Skip to content

fix: chunked BF16, buffer cap, drop fake FMA#55

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/bf16-chunked-review
Mar 30, 2026
Merged

fix: chunked BF16, buffer cap, drop fake FMA#55
AdaWorldAPI merged 1 commit into
masterfrom
claude/bf16-chunked-review

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

Addresses review items from the BF16-direct rebase:

Fix 1: mul_add with zero addend

sums[bin].mul_add(splat(scale), splat(0.0))sums[bin] * splat(scale)

FMA with zero addend wastes the fuse — same latency as plain multiply but occupies the FMA port instead of the multiply port.

Fix 2+3: Chunked row-batch reading + buffer cap

Before: read_tensor_bf16_raw allocates full tensor as Vec<u16>. For ffn_gate_exps (128 experts × 5120 × 13824) that's 10.7 GB.

After: Read in row batches capped at MAX_BUF_ELEMS (64M u16 = 128 MB). Each batch: read → project → extend results. Buffer never exceeds 128 MB regardless of tensor size.

chunk_rows = min(n_rows, MAX_BUF_ELEMS / n_cols)
           = min(655360, 64M / 13824)
           = min(655360, 4629)
           = 4629 rows per chunk (~122 MB)

142 chunks for the largest tensor, each fully processed before the next read. Peak RAM stays at ~128 MB instead of 10.7 GB.

shrink_to(MAX_BUF_ELEMS) after oversized tensors prevents the buffer from persisting at inflated size.

Bonus: progress logging

Large tensors now print ... 4629/655360 rows (0.7%) per chunk so you see activity during multi-minute reads.

What's unused

read_tensor_bf16_raw() is no longer called in stream_index_gguf_bf16 (replaced by inline chunked reads). Kept for potential test use.

1. mul_add with zero addend → plain multiply (was wasting an FMA slot)
2. Chunked row-batch reading for BF16 tensors: caps buffer at 128 MB
   regardless of tensor size. A 10.7 GB ffn_gate_exps reads in ~4.8K
   row batches instead of one 10.7 GB allocation. Minimum batch = 8
   rows (one F64x8 SIMD width).
3. Buffer shrink_to after oversized tensors: bf16_buf is truncated
   back to MAX_BUF_ELEMS (64M u16 = 128 MB) if it somehow grew past.
4. Progress logging within large tensors: prints row count every chunk
   so you see activity during multi-minute tensor reads.

read_tensor_bf16_raw() is now unused in the main path (kept for
potential direct use in tests or smaller models).
@AdaWorldAPI AdaWorldAPI merged commit 0168de9 into master Mar 30, 2026
5 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant